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ABSTRACT 

The controller protein C.Esp1396l regulates the 
timing of gene expression of the restriction-modifi- 
cation (RM) genes of the RM system Esp1396l. The 
molecular recognition of promoter sequences by 
such transcriptional regulators is poorly under- 
stood, in part because the DNA sequence motifs 
do not conform to a well-defined symmetry. 
We report here the crystal structure of the controller 
protein bound to a DNA operator site. The structure 
reveals how two different symmetries within the 
operator are simultaneously recognized by the 
homo-dimeric protein, underpinned by a conform- 
ational change in one of the protein subunits. 
The recognition of two different DNA symmetries 
through movement of a flexible loop in one of the 
protein subunits may represent a general mechan- 
ism for the recognition of pseudo-symmetric DNA 
sequences. 

INTRODUCTION 

Restriction-modification (RM) systems play a central role 
in modulating the horizontal transfer of genes in bacterial 
populations and thus in the transmission of antibiotic re- 
sistance between bacterial species (1). An understanding of 
the molecular mechanisms of gene regulation in RM 
systems, and their impact on the flow of genetic informa- 
tion in bacterial populations, is thus of great interest. 

RM systems encode a restriction endonuclease (ENase) 
and a DNA methyl transferase (MTase). The sequence- 
specific DNA methyltransferase protects the host DNA 
from cleavage by the associated restriction enzyme, and 
the specific methylation pattern of the host RM system 
allows the discrimination of 'self from 'non-self DNA 
(2). Premature expression of the endonuclease prior to 
protection of the host DNA by the methyltransferase 



would be lethal. Thus, there are a variety of control mech- 
anisms that ensure the correct temporal expression of RM 
genes. In many systems, this is accomplished by means of 
a 'controller' (C) protein encoded by a gene downstream 
of its own promoter, and the C-gene is co-transcribed with 
the endonuclease (R) gene as a single transcriptional unit 
(3-7). The C-protein binds at various sites within the C/R 
promoter to regulate transcription of its own gene and the 
associated endonuclease gene (8). 

Measurements of C-dependent transcriptional activity 
in vitro, together with mathematical modelling of the 
gene control circuits, have shown the time dependence of 
the activity of this switch (9). In vivo experiments have 
directly demonstrated a time lag in the expression of the 
ENase with respect to the MTase when the C-protein is 
expressed in a new host (10). 

In most C-protein systems so far investigated, the 
operator sequence at the C/R promoter has binding sites 
(denoted Ol and Or) that can accommodate two 
C-protein dimers (11,12). Ol is distal to the gene and 
has the highest affinity for a C-protein dimer. Or is 
proximal to the gene and the intrinsic affinity for this 
site is weak; however, when a C-protein dimer is bound 
to Ol then the affinity for Or increases around 1000-fold 
(12,13). 

EarHer biochemical and biophysical analysis in our 
laboratory suggested the basis of the genetic switch in 
Ahdl (11-15). Low-level expression of the C-protein 
from a weak promoter leads to a delay in transcription 
until sufficient protein accumulates to form a functional 
dimer. The C-protein dimer activates transcription of the 
C/R operon, forming a positive feedback loop, which 
leads to a rapid increase in C-protein expression; at 
higher concentrations, a second dimer is recruited to the 
promoter, displacing the a subunit of RNA polymerase 
and thereby repressing transcription of its own gene 
(and hence expression of the R gene) in a negative 
feedback loop (Figure 1). A similar, but more complex, 
mechanism has been proposed for the R-M system 
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Figure 1. The conserved DNA sequences obey different symmetries, 
(a) The C-boxes are symmetrical if the pseudo-dyad axis is placed 
between the central GT. (b) If the pseudo-dyad axis is placed at the 
central T, the C-boxes are no longer symmetrical but the other 
conserved element are. The pseudo dyad axes within operators (blue) 
and between operators (red) are shown as dotted lines. Figure 
reproduced from McGeehan et al. (2008). (c) C-proteins act as a 
genetic switch regulating the timing and expression of R-M genes. 
The —35 (green) and —10 (red) regions are indicated upstream of the 
C-gene (light blue) and the R-gene (data not shown). C-protein dimers 
are shown in blue and the sigma subunit of RNA polymerase in 
orange. C-protein is expressed at low levels from a weak C-independent 
promoter (data not shown). A C-protein dimer first occupies the 
high-affinity Ol site and stimulates transcription of the C-gene 
through recruitment of RNA polymerase sigma subunit to the —35 
site. As the C-protein concentration increases, a dimer occupies the 
Or site and occludes the —35 site down-regulating the expression of 
the C- and R-genes. Adapted from McGeehan et al. (11). 



Esp 13961. Experiments conducted in collaboration with 
Severinov and colleagues (16) have shown that in the 
R-M system Esp 13961, the C-protein, in addition to regu- 
lation of the R gene, represses the M gene by binding as a 
dimer to a high-affinity site that overlaps the transcrip- 
tional start site. 

Bioinformatic analysis of known and potential 
C-protein binding sites has identified a repeating symmet- 
rical 'consensus' sequence consisting of four quasi- 
symmetrical 'C-boxes' GACT— AGTC — GACT— 
AGTC upstream of the C/R genes in a wide variety of 
R-M systems (6,8), and a similar sequence is found 
within the 35-bp sequence that has been identified by 
DNA footprinting in Ahdl (12). However, the degree of 
sequence homology between species is moderate and the 
internal symmetry between 'C-boxes' is far from perfect 
(Figure la). The GT dinucleotide in the centre of the 
proposed consensus sequence is in fact more highly 
conserved than the proposed GATC tetranucleotide rec- 
ognition sequences (8), but clearly lacks dyad symmetry. 
The proposed 3-bp 'spacers' within the left and right 
operator sequences are equally well conserved, the consen- 
sus sequence being TAT. 

We initially solved the structure of the controller protein 
C.Esp 13961 bound as a tetramer (i.e. two dimers) to its 



35-bp operator sequence — the first DNA-protein struc- 
ture for any C-protein complex (11). The structure of 
the nucleoprotein complex shows the molecular basis of 
cooperative binding, consisting of protein-protein electro- 
static contacts between dimers, together with structural 
changes in the DNA that facilitate binding of the second 
dimer. In the crystal structure of the C.Espl3961 — 35-bp 
operator complex (PDB code: 3CLC), the pseudo-dyad 
axis relating the two operators is shifted by half a base 
(i.e. centred on T rather than the expected GT). Although 
the pseudo-dyad between GACT/AGTC sequences is then 
lost, there are instead perfectly symmetrical TATA 
sequences at the centre of each operator (Figure lb). 

The structure suggested the mechanism whereby 
cooperative binding of dimers to the DNA operator 
governed the switch from activation to repression of the 
C and R genes (11). The overall structure of the complex 
comprises two dimers bound to the DNA, each centred on 
the pseudo-dyad located between the central A and T 
bases in the TATA sequence that is found at the centre 
of each operator site. The two dimers are bound to ap- 
proximately opposite faces of the DNA. Each dimer bends 
the DNA by ca. 50°, and inserts helix-3 of the classical 
HTH motif into the major groove of DNA, either side of 
the central TATA within each operator. In this structure, 
the two protein dimers are related by a dyad axis that 
coincides with the pseudo-dyad axis lying within the 
central T:A base pair of the 35-bp duplex. 

Some clear protein-DNA interactions were also identi- 
fiable, in particular the interaction of R35 with the 
conserved G3 on both DNA strands. However, the struc- 
ture was relatively low resolution and, moreover, 
since both orientations of the DNA were present in the 
asymmetric unit, the resulting structure was symmetry- 
averaged, which precluded a detailed analysis of the 
protein-DNA contacts. 

In order to clearly identify the protein-DNA inter- 
actions, and thus determine the molecular basis of DNA 
sequence recognition (and in particular how deviations 
from symmetry within the DNA recognition site 
are accommodated), we have crystallized a C.Espl396I 
dimer bound to a single dimer binding site, Ol 
(the stronger of the two binding sites that are located 
upstream of the endonuclease gene). The DNA sequences 
employed were designed with overhanging bases, to facih- 
tate intermolecular packing via end-to-end stacking of 
unpaired bases. What we found, however, was a com- 
pletely novel packing interaction, in which the bases 
formed DNA triplets between adjacent DNA duplexes. 



MATERIALS AND METHODS 

Purification 

Purification of C.Espl396I was carried out as previously 
described (11). Briefly, large-scale cultures of E. coli 
BL21(DE3) containing the plasmid pET-28b/ft9/77 J9(5/C 
were grown. Over-expressed C.Esp 13961 was harvested 
by sonication and purified using nickel affinity chroma- 
tography. The N-terminal hexa-histidine tag was removed 
by thrombin digestion but the purified protein retained 
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a GSH tripeptide. DNA oligonucleotides were purified as 
previously described and annealed to form a duplex (11). 

Crystallization 

C.Espl3961 was incubated with an 18-bp duplex DNA at 
varying ratios prior to crystal screening. Selected crystals 
were formed in the PACT Premier 67 condition (0.2 M 
sodium acetate, 0.1 M Bis-Tris-pmpane, 20% PEG3350, 
at pH 5.0 in 4 [i\ hanging drops (2 (xl protein/DNA and 2 |il 
mother hquor) at 16°C. The final protein:DNA ratio was 
2:1 (monomer:DNA) at ~30nM final DNA concentra- 
tion. The crystals were mounted in htholoops (Molecular 
Dimensions), cryoprotected in 35% v/v ethylene glycol 
and cryo-cooled in Hquid nitrogen. 

Structure solution and refinement 

Data were collected from cryo-cooled crystals of the 
19-mer complexes on ID14-4 at the ESRF (Grenoble) at 
lOOK using an ADSC 4Q CCD detector. The complex 
crystallized in the space group P22i2i; reflections 
extended to ~1.9A and 110 images were collected with 
an oscillation angle of 1°. The data were processed and 
scaled using XDS/XSCALE (17) with an overall Rmerse of 
11% and an overall completeness of 90.7% at 2.1 A 
(Table 1). The scaled data was phased by molecular re- 
placement with C.Espl396I dimer bound to the left 
operator within the tetrameric complex (chains A and B, 
residues 5-75, chain C, residues 6-13 and chain D, 
residues 23-30) as the search model using Phaser (18). 
Iterative refinement was carried out using Refmac5 (19) 
with TLS restraints enabled (Table 1). The missing DNA 
bases were manually added into interpretable electron 
density using Coot (20), as were 11 of the 16 missing 
terminal amino acid residues. Waters were added during 
refinement with Refmac5 and checked manually. The final 
structure contained all 38 bases and amino acid residues 
2-77 and 3-79 in chains A and B, respectively. The final 
parameters used during refinement are shown in Table 1. 
The DNA base pair parameters were calculated using the 
software package CURVES+ (http://gbio-pbil.ibcp.fr/cgi/ 
Curves_plus). The coordinates of the nucleoprotein 
complex have been deposited in the Protein Structure 
Database (PDB code 3S8Q). 



RESULTS 

Structure solution 

The DNA sequence (an 18-bp duplex with an overhanging 
base at the 5'-end of each strand) was designed to aid the 
formation of pseudo-continuous DNA in a single orienta- 
tion and thus overcome the symmetry-averaging problems 
encountered in the tetramer complex structure (11). The 
averaging problem was indeed overcome in this structure, 
but the DNA did not form a pseudo-continuous hehx. 
Instead, the DNA ends are involved in crystal packing 
interactions between symmetry related molecules and 
form triple helical interactions (Figure 2). The terminal 
two bases are paired on both the Hoogstein and 
Watson-Crick edges to form a base 'triplet' at both ends 



of the DNA (T-AT and A-GC), which maximizes base 
stacking. These triple helical interactions help to stabihze 
the DNA ends, which refined with low B-factors, despite 
not being involved in protein-DNA interactions. 

One complex, consisting of a C.Espl396I dimer bound 
to a DNA duplex (Figure 3), is present in the asymmetric 
unit and the structure refined to 2.1 A with a final R/Rf^gg 
of 16.8/22.4% (Table 1). Iterative refinement was carried 
out using Refmac (19) with TLS restraints enabled. All of 
the DNA bases are clearly resolved, as are all except a few 
amino acid residues at the N and C termini of each protein 
subunit. In addition, a total of 314 solvent molecules could 
be located, including a number of water molecules 
mediating protein-DNA interactions. 

From a superposition of the dimeric Ol complex and 
the appropriate region of the tetrameric complex 
(Supplementary Figure SI), the protein and the DNA 
components of both complexes are for the most part iden- 
tical, although the side-chains of the protein and the bases 
of the DNA can be positioned with far greater rehabihty 
in the dimeric complex. One major difference, however, is 
that one region of the protein (residues 43^7) exhibits a 
different conformation in each subunit in the dimeric 
complex. This is probably also the case in the tetrameric 



Table 1. X-ray crystal data, refinement parameters and model 
statistics for the 19-mer Ol complex structure 

Data collection 



Space group 

Unit-cell parameters (A, °) 



Resolution limits (A) 

-/Emerge* (%) 

//o(/) 

Completeness (%) 
Refinement parameters 
Scaling 
TLS 

No. of Groups 
Description 
Refinement model statistics 
No. of reflections 

^cryst/^free* (%) 

No. of atoms 

Protein 

DNA 

Water 
Average B-factors (A^) 

Protein 

DNA 

Water 

RMS deviations from ideal 
Bond lengths (A) 
Angles (°) 



P22i2i 
a = 44.3 
h = 61.5 
c = 113.7 
tK = P = Y = 90 
50-2.1 (2.2-2.1) 
11.0 (37.1) 
15.3 (4.6) 
90.7 (86.9) 

Babinet 



Individual chains 

17 096 
16.8/22.4 

1279 

773 

314 

31.85 
35.49 
47.29 

0.01 
1.3 



Values in parentheses are for the highest resolution shell. 
*^merge = ShkiS,|/,(hkl)-«/(hkl)»|/S,,t,S,/,(hkl), where «/(hkl)» is the 
mean intensity of reflection /(hkl) and /,(hkl) is the intensity of an 



individual measurement of reflection /(hkl). iR^, 



— ^hkll 



^obsl — l-fcaicl/^hkil^obsh whcrc F^bs is tlic obscrvcd structure-factor 
amplitude and F^nic is the calculated structure-factor amplitude, i^free 
is the same as R^^ysi but for the 5% of structure-factor amplitudes that 
were set aside during refinement. 
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Figure 2. Triple helical DNA interactions between symmetry related 19-mer complexes, (a) The triple helical interactions occur between Ai and G19 
of chain C (blue and beige, respectively) and T'l and A' 19 of chain D (pink and green, respectively), (b) A cartoon representation depicting how the 
Watson-crick edge of the overhanging 5' base (either T'l or Ai) forms hydrogen bonds with the Hoogstein edge of the terminal base of a symmetry 
related molecule (A'19 or G19, respectively). 




1 5 10 15 19 

chain C : ATGTGACTTATAGTCCGTG 

chain D : acactgaatatcaggcact 

19' 15' 10' 5' 1' 

Figure 3. The C.Espl396I Ol nucleoprotein complex structure. The 
asymmetric unit of the crystal contains a C.Espl396I dimer (chains A 
and B; blue and green, respectively) bound to DNA (chains C and 
D; beige and pink, respectively). The DNA duplex consists of 18 bp 
with a 5' overhang on each strand. 



complex, but the combined averaging effects, together 
with the low resolution of the data, resulted in smeared 
density in these areas and prevented unambiguous refine- 
ment. These two alternative loop conformations were first 
identified in the free protein structure, where there are 
seven dimers in the asymmetric unit (21). Two of the 
14 subunits in the asymmetric unit adopted a different 
conformation to the other 12, and the two conformations 
are hkely to be of comparable stabihty. In the case of the 
Ol nucleoprotein complex, both conformations can 
be found in the same dimer; in monomer A, the loop 
adopts the 'major' conformation and in monomer 
B, it adopts the 'minor' conformation, as can be clearly 
seen in the electron density (Supplementary Figure S2). 
The different loop conformations in the DNA-protein 
complex may reflect the departure from true dyad 
symmetry in the Ol operator sequence (Figure 1). 

Analysis of protein-DNA interactions 

The recognition helix. The recognition helix (residues 
35^3) of each subunit inserts into the major groove of 
the DNA. Two residues in this helix, R35 and T36, are 
involved in direct readout of the DNA sequence while 
other residues are involved in non-sequence-specific inter- 
actions with the phosphate backbone (Supplementary 



4162 Nucleic Acids Research, 2012, Vol. 40, No. 9 



Figures S3 and S4). In both subunits of the dimer, the 
y-hydroxyl of T36 H-bonds to the N4 group of a 
cytosine; however, T36 in chain A recognizes C'i6 of one 
DNA strand and the T36 in chain B recognizes C15 of the 
other strand. 

The amino groups of R35 in each subunit interact with 
the N7 and 06 of a guanine. R35 in chain A recognizes G3 
on one DNA strand (Figure 4a, b). However, the R35 in 
chain B cannot make the symmetry equivalent interaction 
on the other strand, as an adenine (A'3) rather than a 
guanine is in the equivalent position in chain D. Instead, 
the R35 interacts with the N7 and 06 of the Gn on chain 
C (Figure 4c); it is clear that the flexible side-chain of 
arginine is capable of accommodating the departure 
from dyad symmetry in the Ol DNA sequence. In 
addition to hydrogen bonding with G3, R35 in chain 
A is involved in indirect readout by stacking of the 
planar guanidinium group with the exposed face of T2 
as described elsewhere (11) and is thus specific for a TG 
dinucleotide at this position. There is no equivalent 
stacking of R35 of chain B, however, since there is no 
equivalent TG dinucleotide at this site (there is an 
adjacent T but it is on the 3' side of the G, which therefore 
adopts a very different base stacking pattern). 

Helix 3 of C.Esp 13961 recognizes bases within the 
5'-AGTC sequence (the 'bottom' strand), rather than the 
5'-GACT on the complementary ('top') strand. Each rec- 
ognition helix of the dimer makes these two direct 
sequence-specific interactions, one (from each subunit) to 
the highly conserved G3 or Gn base outside of the C-box 
and another to the cytosine complementary to the G in the 
AGTC sequence. In addition, however, the G and T bases 
in this sequence are recognized by R46 that hes within the 
flexible loop region (as discussed below). The two 
C.Espl3961 monomers are able to make non-symmetrical 



interactions with the DNA due to the flexible nature of the 
R35 sidechain, and can thus adapt to the asymmetrical 
location of the C-boxes in the Ol DNA sequence. 

The alternative loop conformations. In the free protein 
structure, two alternative conformations of the loop 
region (residues 43^6) between helices 3 and 4 were 
observed (21). Comparing the two conformations, the 
side-chains of N44 and S45 are flipped almost 180° 
about the peptide backbone. In the minor conformation, 
it was postulated that the polar head groups of the aspara- 
gine (N44) sidechain would be in close proximity to 
the DNA and may be involved in protein-DNA inter- 
actions (21). 

Figure 5 shows both loop conformations with respect to 
the DNA, and the atoms involved (see also Supplementary 
Figure S5). The side-chain of N44 in the major loop 
(Figure 5a) points towards the core of the protein and 
the terminal carbonyl and amino groups, are stabilized 
directly through interactions with the backbone amino 
group of S7 and the y-hydroxyl of SIO. The 5-aniino of 
N44 and the backbone carbonyl of S7 also coordinate a 
water. These interactions provide stability for the major 
loop conformation. In the minor loop conformation 
(Figure 5b) the N44 rotates ~180° about the backbone 
and the side-chain points towards the DNA backbone. 
The §-amino of N44, the r|-amino of R43 and the phos- 
phate oxygen of G5 coordinate a water that stabilizes the 
N44 sidechain. 

Although the side-chain of S45 is rotated ~180° 
between the two loop conformations, in both instances 
the y-hydroxyl is stabilized by interactions with other 
amino acid side-chains. In the major loop conformation, 
the y-hydroxyl interacts with the backbone carbonyl of a 
glycine (G4) while in the minor loop conformation the 
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Figure 4. Direct and indirect readout of DNA by R35. (a) The R35 recognizes the G3 and G17 in the DNA sequence and the highhghted bases 
(beige: chain C and pinlc: chain D) are shown in b and c. (b) The R35 from chain A recognizes the conserved TG by interacting with the 06 and N7 
of the guanine base. The planar guanidinium group of R35 also stacks with the thymine base, (c) The symmetry related interaction cannot be made 
by chain B, which instead recognizes G17 via the 06 and N7. All hydrogen bond distances are <3.2A. 
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Figure 5. Comparison of the interactions made by the flexible loop region in chains A and B of the 19-mer Ol complex structure. The hydrogen 
bond interactions made by the flexible loop regions in chains A and B (a and b, respectively) are shown as black dashes. Residues involved in 
stabilizing the loop region are represented as thin lines. Water molecules are represented by green spheres. 



Y-hydroxyl interacts with the y-hydroxyl of a serine (SIO) 
and also coordinates a water. The R43 interacts with the 
phosphate backbone in both loop conformations and the 
polar groups of N44 are also stabilized in both conform- 
ations, albeit by different groups. 

In both conformations of the loop region, arginine 46 is 
involved in direct readout of the DNA sequence. In the 
major loop conformation (monomer A), R46 interacts 
with the N7 and 06 of G'14 from chain D, as well as 
H-bonding to the 04 of the adjacent T'15 (Figure 5a and 
Supplementary Figure S5a). Only guanine and thymine 
have electron donors in the correct position to interact 
with the amino groups of this arginine; thus the recogni- 
tion of these two bases confers specificity. In the minor 
loop conformation (monomer B), R46 is involved in inter- 
actions with the Gi3 and T14 of chain C (Figure 5b and 
Supplementary Figure S5b). However, the head group of 
the arginine is positioned in such a way as to only directly 
interact with the N7 of G15. This interaction is sufficient to 
distinguish between purine and pyrimidine but is unable to 
distinguish between adenine and guanine. The ri-amino 
group of R46 also makes a water mediated contact with 
the 04 of T14. As water can act as either a donor or an 
acceptor in hydrogen bonding, this interaction cannot dis- 
tinguish thymine from cytosine. This water forms part of 
the network of highly organized waters found in the major 
groove of the DNA. 

Symmetry of the protein subunits and DNA. Figure 6a 
shows a superposition of subunits A and B by a rotation 
about the dyad axis that relates them. The overall 
backbone structure of the two subunits is very similar, 
with the only notable difference occurring in the flexible 
loop region. The GTC motif recognized the by amino acid 
side-chains of T36 and R46 shifts by approximately half a 
base pair relative to the protein. The flexible loop is able to 
accommodate this half base pair shift, permitting 



recognition of the GTC by monomer B. In an alternative 
view, if the GTC bases that are recognized by each subunit 
(one in each half-site) are superimposed (Figure 6b), the 
protein rotates by ~30° around the helix axis of the DN At 

DNA structure and backbone interactions. Analysis of the 
OL DNA structure in the complex was performed using 
CURVES (22). The minor groove at the TATA site is 
compressed (from ~7A to ~2A), which leads to the 
DNA being significantly bent about this sequence, as 
shown in Figure 7. The overaU bend of the DNA duplex 
is ~40°. From circular permutation assays, it is clear that 
the DNA is not intrinsically bent; rather, the bending is 
induced when the C-protein binds to its operator, i.e. the 
sequence of the operator DNA is one that can readily be 
deformed when the C-protein binds (11,14). DNA bending 
around the TATA site permits a form of indirect readout. 
The bend in the DNA at the TATA site is accompanied by 
deviations in the base pair and step parameters (Figure 8 
and Supplementary Figure S6). The parameters for the 
OL sequence show values that are typical of TATA se- 
quences (23) except for the roU parameter, which is closer 
to standard B-form DNA. The twist values for the two 
thymines differ significantly from the standard B-form 
values, suggesting that the DNA bending can be 
achieved by partially unstacking the TATA bases. 

The tetrameric complex structure suggested a possible 
role for Y37 and S52 of each subunit in the dimer in com- 
pressing the minor groove at the TATA sequences by 
binding to the phosphate backbone of the DNA on 
either side (11). These interactions are seen much more 
clearly in the Ol complex structure, with additional 
backbone contacts being made by N47 (Figure 7). For 
each subunit, the hydroxyl groups of Y37 and S52 
interact with the same phosphate group of the DNA 
(5' of nucleotide 13 in both DNA strands) and the 
amino group of N47 interacts with the phosphate 5' of 
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Figure 6. Comparison between the two half sites in the 19-mer Ol complex structure, (a) Monomers A and B (blue and green, respectively) were 
superimposed with RMSD = 0.34 A (222 main chain atoms). The DNA bases are offset by approximately half a base pair. Bases involved in direct 
readout are shown as thick lines. The half base pair shift is compensated for through the flexibility of the loop region, permitting recognition of the 
GTC bases, (b) Residues 13-15 of chain C (blue) were superimposed upon residues 14-16 of chain D (green) (RMSD = 0.54A over 61 backbone 
atoms). Hydrogen bonds are represented by dashed lines. 




Figure 7. Non-specific DNA contacts stabilize the complex and the compression of the minor groove. Chains A-D are coloured blue, green, beige 
and pink, respectively. Hydrogen bonds are represented by dashed lines. The nucleoprotein complex is stabilized by non-specific interactions between 
amino acids (e.g. R43; inset left and right) and the phosphodiester backbone of DNA. The minor groove is compressed at the TATA sequence, which 
results in the DNA being significantly bent. Y37, N47 and S52 from both monomers are involved in stabilizing the bend (inset centre). 
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ATGTGACTTATAGTCCGTG 
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Figure 8. Compression of the minor groove at the TATA sequence 
results in DNA bending. The overall DNA bend is ~41° and the 
local bend angle between adjacent base pairs (calculated as the 
angle formed between the normals of adjacent base pairs) is greatest 
at the TATA sequence (red line: minor groove width; blue line: local 
bend angle; dashed line: minor groove width of standard B-form 
DNA). 



nucleotide 12. The serine in chain B has a dual conform- 
ation (both conformations were refined with 50% occu- 
pancy), but both conformations interact with the same 
phosphate group. These interactions cause the minor 
groove to be compressed and the DNA to be bent. In 
addition, there are interactions of the side-chains of 
residues R17, N24, S39 and R43 with phosphate groups 
at either end of the DNA (Figure 9 and Supplementary 
Figures S7 and S8), which further stabilize the bent DNA 
conformation. The interactions of the negatively charged 
phosphate groups with the positively charged guanidiniuiTi 
groups of R17 and R43 wiU be particularly strong, and 
should make an important contribution to the overall 
binding energy. 



DISCUSSION 

A highly conserved inverted repeat with the consensus 
GACT . . . AGTC was originally thought to be the 
binding motif, and this recognition sequence appeared to 
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Figure 9. Schematic representation of nucleoprotein interactions, (a) Residues Arg35, Thr36 and Arg46 are involved in direct readout of the DNA 
sequence, (b) Overview of protein-DNA interactions. Phosphate groups are represented as circles, and those interacting with the protein are coloured 
according to the subunit contacted. Interactions between chain A and the DNA are highlighted in blue and interactions between chain B and DNA 
are highlighted in green (for further details, see Supplementary Figures S7 and S8). 
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be common to a large family of C-proteins (and in most 
cases is itself duplicated) in the region upstream of the 
C/R promoter (6,8,10). In addition to these C-boxes, it 
was noted that the flanking TG (and the symmetry 
related CA) is also highly conserved, as are the central 
GT between dimer binding sites and the TAT sequences 
between the C-boxes. 

However, these motifs follow different symmetries 
(Figure la and b) with the pseudo-dyad axis either 
between the central GT (C-box symmetry), or through 
the central T, respectively (11). As each of the above 
sequence elements are conserved and both have been 
shown to be important in C-protein binding (16), the 
protein must be able to accommodate both symmetries. 
The structure we now report of the C.Espl3961 controller 
protein dimer bound to the Ol operator shows how this 
dual recognition is achieved. 

The flexible loop that was observed in the free pro- 
tein plays a fundamental role in breaking the symmetry of 
the protein dimer and permits base-specific interactions 
with a variety of DNA sequence motifs that are not com- 
pletely symmetric. The GTC bases in the C-box are 
recognized by T36 and R46 of each subunit, and the 
conserved TG motif (T2/G3) and Gn are specifically 
recognized by R35 in the recognition helix of subunits A 
and B, respectively (Figure 9). In order for the R46 in 
chain B to interact with the DNA bases in the second 
half-site, which are displaced from their symmetry equiva- 
lent positions, the flexible loop adopts the minor 
conformation. 

From close inspection of Figure 9, it can be seen that 
the non-specific interactions of the protein with the phos- 
phate groups in the DNA backbone follow the symmetry 
that has the dyad centred on the TATA sequence, as in 
Figure lb. In contrast, the interactions of the GTC bases 
in the C-box motifs follow the symmetry that is centred on 
TAT (Figure la). The amino acid residue (R46) respon- 
sible for the majority of the interactions with the GTC 
motif moves to accommodate the ~1.7A shift (and '^18° 
rotation) of these bases. This is enabled by the change in 
the loop conformation in subunit B, as well as by the 
inherent flexibility of the arginine side-chain, resulting in 
the relative displacement between the two subunits that 
can be seen in Figure 6. 

Pseudo-symmetric DNA sequences are common in gene 
control regions of DNA, and the asymmetry plays an im- 
portant biological role in determining the differential 
binding affinity for different promoters (24). In the struc- 
tural analysis of such systems, symmetrized operator 
sequences have often been employed (25-27). In others 
where the natural operators have been used, electron 
density around the non-symmetric bases is unclear 
[as was indeed the case for the C.Esp 13961 tetramer 
complex (11)]. However, in the lambda cl repressor, 
non-symmetrical interactions can be seen that depend 
upon the movement of the flexible N-terminal tail of the 
protein (28,29). 

In C-protein recognition sites, there are additional 
and more pronounced deviations from symmetry, i.e. a 
translation between the dyad axis that defines the major 
recognition motif GTC/GAC (Figure la), and the dyad 



axis defining the protein-DNA backbone interactions 
(Figure lb). As discussed above, this displacement 
(1.7 A) together with the resulting ~18° rotation between 
the axes defining the two symmetries is accommodated by 
a conformational change in one of the subunits, thus 
breaking the symmetry of the C-protein homo-dimer in 
order to match that of the DNA recognition sequence. 
Many other known and putative C-protein recogni- 
tion sites are similar to those found in the Esp 13961 
restriction-modification system (30) and it would not be 
surprising if a simflar mechanism of recognition apphed in 
these cases. Indeed, it is possible that the recognition of 
two different DNA symmetries through movement of a 
flexible loop in one of the protein subunits may represent 
a general mechanism for the recognition of such DNA 
binding sites. 
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